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|30 

To all whom it may concern: 

Be it known that we, Youssef (NMI) Drissi, Moon Ju Kim, Lev (NMI) Kozakov and Juan 
(NMI) Leon Rodriguez, citizens of Moroco), United States of America, Israel and Mexico, 
35 respectively, residing in the states of New York, New York, Connecticut and New York, 
respectively, have invented new and usefiil improvements in 

METHOD AND SYSTEM FOR SEARCHING A MULTI-LINGUAL 

DATABASE 



40 



of which the following is a SPECfflCATION: 
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METHOD AND SYSTEM FOR SEARCHING A MULTI-LINGUAL 

DATABASE 

Bact^round of the Invention 

5 

Field of the Invention 

The present invention relates to the field of searching a database using search term(s) 
entered by a user. More particularly, the present invention is a system and method for searching 
10 on a database including material in different languages where the search term(s) are entered in 
one of the languages where tiie database need not be translated into the diff^ent languages. 

Ci 

J} Background Art 

0! 

Mi Various methods have been proposed for searching a database wherein the database 

'^15 includes material in multiple languages. One approach is to translate the entire database into the 



s 



Ci 

language in which a search term is entered or the language of the user. However, this could 



n 



H involve a large amount of translation for a sizable database (and multiple translations if the 

C! 

database is used by users in different languages). Further, each process of translating a document 
has the potential for losing (or distorting) some of the meaning of the original text. 

20 For these reasons, it is desirable to avoid translating the documents to allow for a search 

in a particular language. 

Another approach is to use synonym list and apply it to the search term(s) entered in one 
language. That is, the text of the documents in the database remain in the original language and 
synonyms in each language for each search term(s) are used for the search of the database. This 

25 system may work in some cases but is undesirable in other cases because considering all of 

synonyms in the different languages could lead to incorrect results. The word for "network" in 
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Spanish is "red" and a searph qjj ''i^^twqrlc" v^\\\i?\\ blindly translates the search term would 
incorrectly find English documents whipj^ ipclude the color "red". 

Further, some of the docui^|^^|s i|]pludp text in one language and key words presented in 
a different language to avoid changing Ae meaning. Thus, it is desirable to search a database 
5 which includes these terms but would not be effective to search only for the translated form of 
the word. 

As will be apparent to one skilled in the relevant art, the process of translating and 
searching in multiple langu^es can consume substantial computing resources. Many of the 

gj multi-language database searching techniques require a powerful computer or take an inordinate 

0 

g^O amount of time to process a single search, the amount dependii^ on Ae size of the database the 

m 

W number of supported languages and the nature of the queries. However, the computing resources 

^ have a cost associated with them, either in requiring a larger or faster system or in terms of tying 
CI . 

^ j up the computer while a large task is running to the exclusion of other users. Further, a search 

CI 

W, which takes a long period of time may prevent the user from interactively modifying the search 

P 

ril5 to obtain meaningful results. Accordingly, it is desirable to avoid using large computing 
resources. 

Accordingly, existing systems methods for searching databases have undesirable 
disadvantages and limitations which will be apparent to those skilled in the ^ in view of the 
following description of the present invention. 
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Summary of the Invention 

The present invention overcomes the disadvantages and hmitations of the prior art 
systems by providing a simple, yet effective, method and system for searching a database 
including documents in multiple supported languages. The present invention also supports 
5 searching a database in which the text is comprised of documents written in multiple langu^es, 
including those documents which are written in one language but which include words or 
phrases from a second Ismguage. 

The present invention has the advantage that a translation of the documents in the 

1^1 database into each of the supported languages is not required 

El 

01 10 The present invention also has the advantage that the meaning of the original document is 



not lost or distorted throug^i a banslation process to allow searching of the document in different 
languages. 



Hi The present invention also allows for the searching of a database in a native or natural 

CI 

langm^e while finding documents which are written in other languages. 
15 Other objects and advantages of the system and method of the present invention will be 

apparent to those skilled in the relevant art, in view of the following description of the preferred 
embodiment, taken together with the accompanying drawings and the appended claims. 



20 Brief Description of the Drawings 

Having thus described some of the objects and advantages of the present invention, other 
objects and advantages will be apparent to those skilled in the art in view of tiie following 
description of the invention taken in conjunction with the accompanying drawings in which: 

CHA9.2001-023US1 Page 4 of 16 



Hi 



Fig.l is a diagrammatic view of a traditional search technique in which documents exist 
in two different languages; 

Fig. 2 is a diagrammatic view of a diagram of an improved multi-lingual document 
database index system of the present invention; 

Fig. 3 is a dual language (or multi-language) database search system of the present 
invention; 

Fig. 4 is a flow chart illustrating sample logic performed in practicing the present 
invention; and 

Fig. 5 is a synonym table of the type which is useful in carrying out the present invention 



5)10 as described in cotmection with Figs. 2-4, associating a word in one language with its counterpart 

W 

4^ in another language. 



HI 
£1 

Q Detailed Description of the Preferred Embodiment 

rill5 

In the following description of the preferred embodiment, the best implementation of 
practicing the invention presendy known to the inventor will be described with some 
particularity. However, this description is intended as a broad, general teaching of the concepts 
of the present invoition describing a specific embodiment but is not intended to be limiting tiie 
20 present invention to that as shown in this embodiment, especially since those skilled in the 
relevant art will recognize many variations and changes to tiie specific structure and operation 
shown and described with respect to these figures. 

Fig. 1 illustrates a traditional search system, tiiat is, one of the prior art, in which 

documents in English (a first language) are represented by the symbol 102 and documents in a 
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second language such as a national language (NL) are represented by the symbol 122. While 
each set of documents is maintained separately, each is indexed through a process of extracting 
the keywords and creating an index, represented by the box 104 for the English documents 102 
and the box 124 for the second language documents 122. The next step is that an inverted index 
5 is performed for each set of documents, the English inverted index at block 106 and the second 
iangu^ index represented by block 126. Then, a search or query is formatted and applied 
against a selected one of the databases, represented by an English query at 108 and a national 
language query at block 128. The results of die English query are shown by block 1 10 and die 
results of a national language query are represented by the box 130. Thus, the steps of the 



010 process are carried out separately for each database and including indexing the document at 



T 

m 

£ 

Ci 

m 



block 1 12, creating an inverted index at block 1 14 and conducting a search and providing an 
output at block 116. While the steps are the same regardless of which type of database is used, 
each database is kept separate and each is searched separately and each generates separate 
results. Since this same structure could be applied to any number of separate databases, this 

Ci 

^ll 5 system could expand to support the number of languages desired. 

However, some technical documents are written in a native language (such as Spanish) 
but use technical terms from another language (for example, from English). In such a system, 
searching the national language database for the national language equivalent of a search tenn 
will not find the search term if it is included in the document in another language. 
20 Fig. 2 illustrates a system for merging documents in different languages into a single 

index. As shown in this Figure, documents in a first language (English) are represented by the 
symbol 202 and documents in a second language (a national language) are represented by the 
symbol 204. Keywords are identified from each document in a step 206, then translated into 
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each supported language at block 208. Separate indices 210, 212 in each language are created 
from the translated keywords. Then, an inverted index 214 is created from the translated 
keywords. The translation of keywords is preferably accomplished using a keyword dictionary 
220 which included words in English associated with the corresponding keywords in the national 
5 language (and vice versa) to form a synonym listing which effectively translates a keyword in 
one language into the corresponding term in another language (and vice versa). This listing of 
synonyms accomplishes the translation of keywords in the creation of the indices and for later 
searching as will be described in connection with Figure 3. In order to manage various 
languages, it is proposed to translate each term using the Unicode system (UTF8), although any 



Clio other system v^ich is accurate and consistent could also be used to advantage in the present 

f\ invention. 

01 Thus, the process of creating an inverted index involves steps of creating in block 232 an 

CI index in each language and in creating a merged inverted index in block 234 using the keyword 

m 

dictionary 220 which includes synonyms in each supported language. While two languages are 
^l|l5 shown in the figures of the present invention, the present invention can easily be expanded to 
support the desired number of languages, and, while English is described as one language for the 
documents and for the searches, the present invention is not limited to serving documents in 
English and another language could be substituted, if desired. 

Fig. 3 illustrates a search system of the type which is useful in the present invention. A 
20 query is input at block 310 then passed to a keyword dictionary represented by block 320. The 
keyword dictionary 320 includes a bi-directional translation system which tr^lates keywords 
from the English (or first) language 322 to a national (or second) language 324 and vice versa, 
using, in its preferred embodiment, a stored synonym list in the form of a bi-directional table 
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such as is illustrated and described later, particularly in connection with Fig. 5. The synonym 
table is designed to support a plurality of languages and allow translation between the supported 
languages. The result is a pair of queries, one query 330 in the first language (e.g., English) and a 
second query 340 in a second Imiguage (such as the national language). The English language 
5 query 330 is applied against both the English inverted index 334 and the national language index 
334, and the national language query 340 is applied against the national language index 344, and 
generate results: an English-language hitlist 338 and a national language hitlist 348. The user 
then can select (represented by the box 350) which results are of interest to the user, at least to 
start the process, since it is possible that the user will select one, determine that it is 

CI 

Clio inappropriate and try another selectioa If the user has limited caDabilities in understanding 

ffl 

y English, he may prefer to look at the results 348 in the national language. If the national 

«!>» 

r|i language results 348 are not sufficient (or nonexistent), then he may go on to the English 
C) language results 338. In the alternative, the user may recognize that the results of interest are 

rii 

most likely to be the English results 338 and may start with those results. In another alternative, 

f! 

1 5 the user finds so many resuhs in English that he decides to review the more selective list in his 
national language. 

Fig. 4 illustrates a flow chart of one process of practicing the present invention. As shown 
in this Fig. 4, the process begins with a determination of the language of the user and whether the 
user wishes to limit his universe to documents written in his native language. The first step is to 
20 determine the language of the user at block 410. Perhaps the user has entered his native or 
national language or perhaps it is determined from his entries, such as a query in a given 
language. Then, at block 420 the praters the query jp ^^^s of fceyyvords. Those keywords 
are translated at block 430 and the queries produced are submitted to the searching mechanism 
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at block 440. Results are obtained at the block 450 and a set of results may be selected at block 
460. 

In Fig. 5, a portion of synonym table is shown by the reference numeral 500. The table 
includes a plurality of columns, each associated with a different language. In the Fig. 5 as shown, 
these supported languages are English in column 510, Spanish in column 520, French in column 
530 and Italian in column 540. An additional column 550 is shown provided for anotiier 
language such as German or Japanese, recognizing, of course, that some languages have 
different type of chm^acters from English and some languages have so many different symbols 
that it may be necessary to use a double byte character set to represent some of such languages 
like Japanese. Two sets of synonyms are shown in rows in tiiis Fig. 5, one associated with tiie 
English word "network" in row 560 and one associated with the English word "processor" in row 
570. In practice, the synonym table 500 may have additional columns as desired as shown by the 
symbol 590 (or may have fewer columns if fewer languages are supported and the selection of 
supported languages is a matter of design choice and not a feature of the present invention) and 
will have a row for each keyword, shown by the symbol 580. It is important to note that each 
entry is associated wifli a language so that it is possible to associate a word witfi its language and 
distinguish between the Spanish word for network (red) from the English word for the color red, 
if desired. While the table is shown in tabular form for ease in understanding tiie concept of a 
synonym table, the table may exist in other known formats in storage according to conventional 
data processing techniques.. 

The present invention, it will be recognized, is especially adapted for use in a data 
processing system such as a general purpose computer with a stored program containing 
computer program means including a plurality of instructions. Those instructions will generally 
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be written in a high level language which is readable by a human and translated into machine 
language, that is, simple instructions which are understood by the data processing system. In an 
appropriate instance such instructions could be directly written in a machine language 
programming language, if desired, a system which allows for eflficienQf of execution but v^^iich 
5 is more difficult to program. The present invention is not limited to any particular input 
language. 

As used in the present document, software, computer program and computer program 
means are used interchangeably. Software in the present context means any expression, in any 
language, code or notation, of a set of instructions intended to cause a system having an 

s 

Cj 10 information processing capability to perform a particular function either directly or after either 



or both of the following a) conversion to another language, code or notation; b) reproduction in a 
different material form. The use of tiie Unicode system for managing different languages has 

ii 

been used in the description of the preferred embodiment but other suitable methods for 
representing different languages could also be used to advantage in the present invention, if 



ni 
PI 



' !l5 desired. 



The term national language has been used to represent a language associated with a user 
of the system. This language could be any language supported by the system, and might include 
different languages for different users. So, "national language" might represent Spanish for a 
Mexican or a person fi-om Spain and might represent French for a person from France or other 
20 French-speaking locales. Appropriate synonym tables are available for a variety of common 
languages as are systems for locating key words and separating conraion text with little 
uniqueness from key words which are descriptive of the document under consideration. Such key 
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word locating systems are often technologically directed and identify words which are of interest 
to the technology under consideration. 

Of course, many modifications of the present invention will be apparent to those skilled 
in the relevant art in view of the foregoing description of the preferred embodiment, taken 
5 together with the accompanying drawings and the appended claims. For example, the present 
invention has been described in connection witii documents and searches in En^ish and in a 
national language whereas the number of supported languages need not be 2 and need not be a 
single national language. Furtiier, in some circumstances, tiie documents could be written in a 

^1 combination of supported languages. Additionally, some elements of the present invention can 

fell 

El 

|j 10 be used to advantage wifliout tiie corresponding use of other elements. For example, the use of 



m the synonym or keyword dictionary is not the only way to accomplish the translation of 



keywords into other language . Further, various other devices could be substituted to advantage 



0 



depending on the environmental circumstances. Accordingly, the foregoing description of the 



CI 

ill 15 



preferred embodiment should be considered as merely illustrative of the principles of the present 



invention and not in limitation thereof 
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